Statistical Learning Theory Lecture 22 ( Nov 7 , 2013 ) Stochastic Multi - Armed Bandits

نویسندگان

  • Shivani Agarwal
  • Arpit Agarwal
چکیده

In this lecture we will start to look at the multi-armed bandit (MAB) problem, which can be viewed as a form of online learning in which the learner receives only partial information at the end of each trial. Specifically, in an n-armed bandit problem, there are n different options that can be chosen, or n different arms that can be pulled, on each trial; each of these arms, when pulled, generates some reward (or loss). On each trial, the learner selects one arm to pull, and observes the reward associated with only the pulled arm; the rewards associated with the other arms remain hidden from the learner (it is in this sense that the learner receives only partial information). The goal of the learner is accumulate as much reward as possible over a sequence of trials, e.g. compared to the best fixed arm in hindsight. Such problems arise in a variety of applications, e.g. in clinical trials, where a doctor must select one of n available treatments to give to each incoming patient, and gets to observe the outcome of only the particular treatment delivered to the patient; in ad placements, where one needs to decide which of n available ads to display to any given user, and gets to observe the click behavior/revenue generated only for the ad displayed; and so on. In all these applications, there is an inherent trade-off between exploration (trying a new arm that might yield better rewards) and exploitation (selecting an arm that has been observed to give good rewards so far), and any good MAB algorithm must somehow balance these two aspects.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generic Exploration and K-armed Voting Bandits

We study a stochastic online learning scheme with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits. The primary application of this setting is to offer a natural generalization of ...

متن کامل

Linear Bayes policy for learning in contextual-bandits

Machine and Statistical Learning techniques are used in almost all online advertisement systems. The problem of discovering which content is more demanded (e.g. receive more clicks) can be modeled as a multi-armed bandit problem. Contextual bandits (i.e. bandits with covariates, side information or associative reinforcement learning) associate, to each specific content, several features that de...

متن کامل

Lecture 18: Stochastic Bandits

Last time we talked about the nonstochatsic bandit problem which was a partial information version of our online learning problem. Here we studied situations where at each iteration t, the learner chooses an action at and suffers loss `t(at) which is the only thing the learner observes. We showed that the importance weighting trick can be plugging into any full information algorithm with a loca...

متن کامل

Generalized Risk-Aversion in Stochastic Multi-Armed Bandits

We consider the problem of minimizing the regret in stochastic multi-armed bandit, when the measure of goodness of an arm is not the mean return, but some general function of the mean and the variance. We characterize the conditions under which learning is possible and present examples for which no natural algorithm can achieve sublinear regret.

متن کامل

Reducing Dueling Bandits to Cardinal Bandits

We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form “A is preferred to B” (as opposed to cardinal feedback like “A has value 2.5”), giving it wide applicability in learning from implicit user feedback and revealed and stated prefer...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013